Author: Tim Graf
Date: 24.02.2022
In this Python Notebook I have shown outperformance for predicting the production sum of 20 wind farms using XGB as a model compared to both a persistence model and multivariate regression.
I've looked at various dimensions to understand the data better.
First I've realized how correlated all the assets are with each other and also, but less significantly, with the wind forecast. Hence, this may indiciate that the individual farms are located close to each other or the wind is very similar across the locations.
Second, I've identified yearly seasonality with summer months having less production than winter months. There is no such trend to be seen in daily or weekly averages.
Third, there is a clear relationship between wind forecast and energy production. Beginning around 4-7.5 m/s energy is produced. The energy production cuts off at around 2.5MW, probably due to security reasons that a wind farm cannot produce more than a this threshold. Also I can find anomalies where energy should be produced based on the wind forecast above 7.5 m/s.
Fourth, I've spotted outliers in form of negative or zero production despite high wind forecasts. This may be fore a variety of reasons such as malfunction of the sensor, maintenance or upgrade of the wind production asset, or wind forecasts.
| Dataset A | Dataset B | Dataset C | Dataset D | |
|---|---|---|---|---|
| 2 Wind Forecast | 2 Wind Forecast | 2 Wind Forecast | 2 Wind Forecast | |
| 20 Wind Production | 20 Wind Production - Cleaned | 20 Wind Production - Cleaned | 20 Wind Production - Cleaned | |
| - | - | Min, Max, Avg for each feature | Min, Max, Avg for each feature | |
| - | - | - | Lagged Features & Rolling Means | |
| Datetime | Datetime | Datetime | Datetime |
Time Series can be analyzed using multiple methods:
XGBoost has been used and proven to be better compared to simple persistence models in the context of wind production prediction. See for example:
XGBoost has the benefits that it can a) handle missing data b) work well without normalization of data c) find non-linear relationships d) is computationally efficient e) works well with small datasets compared to Neural Networks.
I use both persistence as also simple linear regression as a benchmark model. Then I compare how my XGBoost performs based on the RMSE (root mean squared error) loss function compared to the benchmark models. Finally I compare the performance of my XGBoost with different datasets as inputs. I have tuned the parameters only once as this is computationally very heavy and I want to have a good comparison between the individual datasets.
I have begun decomposing the time series to work with vector regression. Due to time reasons I have unfortunately not been able to complete it. Nevertheless I've attached the code named 04_var.py
# It is recommended to set up a virtual environment
'''
python3 - m venv / path/to/new/virtual/environment
cd / path/to/new/virtual/environment
cd bin
source activate
'''
# install needed packages
'''pip3 install -r requirements.txt'''
'pip3 install -r requirements.txt'
import datetime
import glob
import sklearn
import matplotlib.pyplot as plt
import math
import numpy as np
import os
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.stats.stattools import durbin_watson
from statsmodels.tsa.api import VAR
from sklearn.metrics import mean_squared_error as mse, r2_score
import seaborn as sns
import warnings
#plt.rcParams.update({'figure.figsize': (9, 6), 'figure.dpi': 120})
### import production data ###
prod1 = pd.read_csv('data/power_production_1.csv', delimiter=';')
prod1['datetime'] = pd.to_datetime(prod1['datetime'], format='%Y%m%d%H%M%S')
print(type(prod1['datetime']))
prod1['year'] = prod1['datetime'].dt.year
prod1['month'] = prod1['datetime'].dt.month
prod1['day'] = prod1['datetime'].dt.day
prod1['hour'] = prod1['datetime'].dt.hour
print(len(prod1) / 24)
print(prod1['datetime'].iloc[-1] - prod1['datetime'].iloc[0])
print(f"2015: {prod1[prod1['year'] == 2015]['datetime'].iloc[-1] - prod1[prod1['year'] == 2015]['datetime'].iloc[0]}")
print(f"2016: {prod1[prod1['year'] == 2016]['datetime'].iloc[-1] - prod1[prod1['year'] == 2016]['datetime'].iloc[0]}")
print(f"2017: {prod1[prod1['year'] == 2017]['datetime'].iloc[-1] - prod1[prod1['year'] == 2017]['datetime'].iloc[0]}")
<class 'pandas.core.series.Series'> 828.9583333333334 828 days 22:00:00 2015: 364 days 23:00:00 2016: 365 days 23:00:00 2017: 97 days 22:00:00
# get all data file names
path = r'data/'
all_files = glob.glob(path + "/*.csv")
all_files.remove('data/forecasts.csv')
# read all files names and append to a list
temp_list = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0, delimiter=';')
temp_list.append(df)
# concatenate all files to a long
df_long = pd.concat(temp_list, axis=0, ignore_index=True)
df_long['datetime'] = pd.to_datetime(df_long['datetime'], format='%Y%m%d%H%M%S')
# reshape to wide
prod_data = df_long.pivot(index='datetime', columns='mill_id', values='production')
prod_data.dropna(inplace=True)
# create folder if not yet created
if not os.path.exists('data_processed'):
os.makedirs('data_processed')
# save to csv
prod_data.to_csv('data_processed/production.csv')
# range of dataset
print(f"Dataset ranges from {prod_data.index[0]} to {prod_data.index[-1]} and a timedelta of {prod_data.index[-1] - prod_data.index[0]}")
print(f"We have {len(prod_data)} rows, hence {prod_data.index[-1] - prod_data.index[0]} days, {len(prod_data)} hours")
print(f"Sanity check: 828 days * 24h hours + 23 = {828*24+23}")
Dataset ranges from 2015-01-01 00:00:00 to 2017-04-08 21:00:00 and a timedelta of 828 days 21:00:00 We have 19891 rows, hence 828 days 21:00:00 days, 19891 hours Sanity check: 828 days * 24h hours + 23 = 19895
# Import forecasts
forc_long = pd.read_csv('data/forecasts.csv', delimiter=';')
forc_long['datetime'] = pd.to_datetime(forc_long['datetime'], format='%Y%m%d%H%M%S')
forc_wide = forc_long.pivot(index='datetime', values='windspeed_forecast', columns='location_id')
# shift forecast to have prediction of same row
forc_wide = forc_wide.shift(periods=-4)
forc_wide.rename(columns={1:'forc_1_ms', 2:'forc_2_ms'}, inplace=True)
forc_wide.dropna(inplace=True)
# save to csv
forc_wide.to_csv('data_processed/forecasts.csv')
print(f"Forecast data is {forc_wide.index[-1] - forc_wide.index[0]} long")
print(f"This is {forc_wide.index[-1] - prod_data.index[-1]} longer than our production data")
Forecast data is 838 days 09:00:00 long This is 9 days 12:00:00 longer than our production data
# # to save time I import the already computed datasets
# dataset_a = pd.read_csv('data_processed/dataset_a.csv', index_col='datetime')
# dataset_b = pd.read_csv('data_processed/dataset_b.csv', index_col='datetime')
# dataset_c = pd.read_csv('data_processed/dataset_c.csv', index_col='datetime')
# dataset_d = pd.read_csv('data_processed/dataset_d.csv', index_col='datetime')
# dataset_e = pd.read_csv('data_processed/dataset_e.csv', index_col='datetime')
# # convert back to index
# dataset_a.index = pd.to_datetime(dataset_a.index)
# dataset_b.index = pd.to_datetime(dataset_b.index)
# dataset_c.index = pd.to_datetime(dataset_c.index)
# dataset_d.index = pd.to_datetime(dataset_d.index)
# dataset_e.index = pd.to_datetime(dataset_e.index)
# # define for later
# cols = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20']
# data = dataset_a.copy().drop(columns=['prod_sum']).dropna()
# # split production and forecast
# data_prod = data[cols]
# data_forc = data[['forc_1_ms', 'forc_2_ms']]
# merge forecasts and production
data = pd.merge(forc_wide, prod_data, how='left',left_index=True, right_index=True)
data.dropna(inplace=True)
data_days = data.resample(rule='d').mean()
data_prod = prod_data
days_avg = data.resample(rule='d').mean()
weeks_avg = data.resample(rule='w').mean()
month_avg = data.resample(rule='m').mean()
data_days_avg = data_prod.resample(rule='d').mean()
data_weeks_avg = data_prod.resample(rule='w').mean()
data_months_avg = data_prod.resample(rule='m').mean()
data_prod_days_sum = data_prod.resample(rule='d').sum()
data_prod_weeks_sum = data_prod.resample(rule='w').sum()
data_prod_months_sum = data_prod.resample(rule='m').sum()
This part serves to understand the data better before starting to model.
width = 10
height = 7
data_prod_weeks_sum.plot(title='Weekly Sums of Production', figsize=(width*2, height))
plt.tight_layout()
Conclusion:
data[['forc_1_ms', 'forc_2_ms']].resample(rule='w').mean().plot(figsize=(width*2, height))
plt.title('Weekly Averages of Forecast')
plt.ylabel('Windspeed in m/s')
plt.tight_layout()
# first 100 hours
ax = data.iloc[:100, ].plot(figsize=(width*2, height), secondary_y=['forc_1_ms', 'forc_2_ms'], title='production during first 100 hours')
ax.set_ylabel('Production in MW')
ax.right_ax.set_ylabel('Wind Speed in m/s')
plt.tight_layout()
# other 100 hours
ax = data.iloc[1900:2000, ].plot(figsize=(width*2, height), secondary_y=['forc_1_ms', 'forc_2_ms'], title="production during 1'900 and 2'000 hours")
ax.set_ylabel('Production in MW')
ax.right_ax.set_ylabel('Wind Speed in m/s')
plt.tight_layout()
#other 100 hours
ax = data.iloc[10000:10100, ].plot(figsize=(width*2, height), secondary_y=['forc_1_ms', 'forc_2_ms'], title="production during 10'000 and 10'100 hours")
ax.set_ylabel('Production in MW')
ax.right_ax.set_ylabel('Wind Speed in m/s')
plt.tight_layout()
#other 100 hours
ax = data.iloc[18000:18100, ].plot(figsize=(width*2, height), secondary_y=['forc_1_ms', 'forc_2_ms'], title="production during 18'000 and 18'100 hours")
ax.set_ylabel('Production in MW')
ax.right_ax.set_ylabel('Wind Speed in m/s')
plt.tight_layout()
Box plots are good for an understanding of the distribution.
df_2015 = data_prod.loc['2015-01-01 00:00:00':'2015-12-31 23:00:00', ]
df_2016 = data_prod.loc['2016-01-01 00:00:00':'2016-12-31 23:00:00', ]
df_2017 = data_prod.loc['2017-01-01 00:00:00':'2017-12-31 23:00:00', ]
# df_2015.boxplot(figsize=(width, height))
# plt.title('2015 boxplot')
fig, ax = plt.subplots(2, 2, figsize=(30, 15))
ax[0, 0].boxplot(data_prod)
ax[0, 0].set_title('Whole Dataset')
ax[0, 1].boxplot(df_2015)
ax[0, 1].set_title('2015 boxplot')
ax[1, 0].boxplot(df_2016)
ax[1, 0].set_title('2016 boxplot')
ax[1, 1].boxplot(df_2017)
ax[1, 1].set_title('2017 boxplot')
Text(0.5, 1.0, '2017 boxplot')
# Hourly Data
sns.pairplot(prod_data)
<seaborn.axisgrid.PairGrid at 0x7fa9c5c90640>